Intro

We are Guy Halevi and Orr Jungerman, final-year students with great interest in data science. We chose this workshop to extend our knowledge and practice major data science principles. It's worth noting we live close to each other in middle-north Tel-Aviv (Old North neighborhood), a district notoriously known to lack parking options. When we were tasked with choosing a topic for this project, we wanted to tackle something important and beneficial for us. Thus, predicting parking in some way was one of the first ideas coming to mind. Luckily, we have been harvesting parking data about few parking lots for quite some time.

The Problem

Our experience with parking lots in Tel-Aviv led us to believe that there are common recurring patterns. For example, on weekdays most parking lots are emptying in the morning and filling up at the evening due to day jobs. On the contrary, on weekends the parking lots tend to empty much later since most of the residents are not working and driving back to their families. To back our experience with real and objective data, we have been collecting for the past 2 years the vacancy status of popular parking lots, on a per-minute basis. We believe we can use this data to accurately predict the status of each parking lot, helping us to plan ahead our schedule. We expect to find interesting anomalies due to holidays or special events (such as lockdowns, rain, etc.). To deal with such anomalies, we will enrich the data with more information such as recurring holidays and more.

Our Data

The data we collected comes in a CSV format, where each row contains the vacancy status of 6 different parking lots at specific time. The parking lots are: Basel, Asuta, Sheraton, Dan, Dubnov and Habima. There are 6 different vacancy statuses: Free - significant number of spots, Few - parking lot is almost full, Full - no spots at all, Active - no indication at that time, Unknown/NaN - we had trouble collecting the status. We started collecting data since July 19, with minute resolution. We managed to collect the data by writing a simple script that parses the website of each parking lot, for example http://www.ahuzot.co.il/Parking/ParkingDetails/?ID=3.

The following map shows the locations of the 6 lots:

title

Exploratory Data Analysis

Let's start by viewing the raw data.
We will use pandas to load the CSV into a dataframe:

Data Assesment

First, we will copy our raw data, so we can modify the copy without altering the original data.

Let's change column names to more standard and conventient names, and set the right types to each column in order to be able to use it later.

As you can see, datetime column is now recognized as a date object instead of a simple string.

Now, let's explore the "parking statuses", or classes in our terminology:

We can see that in addition to the 3 main statuses <Free, Few, Full>, there are 3 additional "unknown" statuses that are quite rare which we want to exclude from our dataset - <NaN, Active, Unknown>.

In addition to excluding the 3 "unknown" statuses, we need to decide how to handle the 3 "main" statuses. The meaning of Few is that there are some parking spots available, but just a few. It means that in some cases, if we checked the status a few seconds earlier or later than we actually did, we would have been able to get different results. So, it makes sense to smooth the data. For example, if the status was Full for a while, then Few for a few minutes and then Full again, it makes more sense to translate this Few to Full than to Free, and the same for the other way around.

In order to do that, we will start by treating Few as "the middle between Free and Full", or in other words, we will translate Free -> 1, Few -> 0.5, Full -> 0, as Free has 1 (a lot) of free parking, Few has 0.5 (few) free parking and Full has 0 (no) free parking.
Later we will normalize each 0.5 point to either 0 or 1, depending on its surroundings.

In order to work with the data, it would be much more convenient if each row had only one class.
In addition, we will remove rows with the class NaN so that our data is clean. We didn't do it before to avoid removing a complete row because of one missing value at one parking lot.
We use melt() to do that:

Data resampling

Our dataset is incomplete - sometimes the scraper didn't run and sometimes there were problems with the API. However, we can try to rectify some of the lost data by resampling and applying a moving average. This approach will also help us to reduce noise and small anomalies.
Using moving average, the discrete availability class becomes a continious variable of the parking availability likelihood at that point of time.

Threshold and Downsample

After we applied a moving average resulting in the availability feature, we'd like to define a threshold for parking status classification. Since Few status was translated as 0.5, we believe it should act as the threshold nubmer too.
Every hour that its average is lower than 0.5 is an hour that was Full most of the time and will be normalized to 0, and every hour that is higher than 0.5 will be normalized to 1.
Regarding hours with an average of exactly 0.5, we decided to treat them as Free (i.e. 0), as it means that for a whole hour, most of the times there were "at least a few spots available".
For computational reasons and simplicity of the data analysis, we decided to downsample the dataset to 1H.

Data Visualization

Now comes the part where we want to visualize our data. The visualizations are supposes to give us a grasp on our data and validate our hypothesis.

Let's start by visualizing the classes distribution among the different lots.
We will use a color palette that gives us the right intuition - 1 is "good" (there are free parking spots!), therefore green, and 0 is "bad", therefore red.

As the histogram shows, some lots have relatively equally distributed classes (e.g. Basel), while some have almost only one class (e.g. Habima).

We can also see that Free is the most common class in all lots.
It's already some interesting information for us, as we personally feel like Basel, the one that's closest to where we both live.


We'd like to mention how difficult it is to visualize the data in a way that makes sense and that is intuitive. We tried many methods but all showed too much information that was not helpful in any way.

For example, let's show classes over time, for one specific month - January 2020:

As you can see, even after we filter out most data and stay with only one month of data (out of ~2 years that we have!) is too much for visualizing without any special treatment.

Then we came up with the idea of weekly patterns. As mentioned before, we believe the parking vacancy status recurrs every week. Let's transform our data to validate this hypothesis.

Weekly Patterns

From our own knowledge and experience, we know that available parking spots change over the week in a periodic way that is very similar between different weeks. For example, every morning in the week days there is quite a lot of parking, and every evening there isn't. In the weekends it changes; Friday morning is busy but Friday evening is always super free (as most young people visit their parents outside of Tel Aviv, probably).

It's a very interesting hypothesis to start with.
Let's use FFT (Fast Fourier Transform) to prove that these patterns actually exist:

FFT

The FFT diagram definitely shows that there is a significant recurring pattern every day (which means that the time of day is very correlated to the parking availability status) and a smaller spike, yet still visible and significant, every week (which means that the day of the week is correlated with the status).

We expected the weekly pattern to be more significant, but we assume it's not as significant as we expected because we expect all 5 weekdays to be quite the same, and only weekends to be different.


Let's group the data by weeks (every week is defined as 00:00 on Sunday until 23:00 on Saturday). To simplify the data, we drop a few rows that are before the first whole week we have data for:

Now that we have week information for each row, we can start showing weekly trends.

Now the visualization is much better and it actually starts to make sense!
We finally have our first informative visualization, that supports our hypothesis of weekly trends.

Note the width of the lines. The wider the line is, the more variation there is in the classes at that point. We can see that all lines are relatively narrow, especially when comparing to the earlier unsuccessful visualization.
The meaning of the narrow lines is that for every hour in a week, its classes are quite similar between all weeks in our data.

Even though it's a pretty good start, the visualization does not prove that the data is completely predictable using lot and day_hour features only. It only shows that it has good correlation.

Now let's take a look from another point of view.

Instead of showing the average parking per hour in the week, among all weeks, let's show the average parking per week over time.
To do so, we need to have the average parking status per week over time.

We define the parking status of a week as the average of all hours within the week.
It's important to note that our class will no longer be one of 2 discrete options - it will become a continuous number. We will only use this data for this visualization and then come back to the earlier per-hour format.

As we can see, there have been some changes over time.
There are several specific things to take from this view.

The first one is that there is a big gap without any data points somewhere between 2020-05 and 2020-07. Not only that, but the next week after the gap is extermely different than usual, mosly to the worse, but not only, and it differs a lot among the different lots as shown by the width of the top line.

The second is the variation around 2019-11, as shown by the width of the top line. Some of the parking lots had much more space than usual, and some much less.

The third is the two peak points around 2020-04 and 2020-10, in which all lots had unusual big amount of free parking spots.

We will try to use these insights later in this project, and will also try to explain them.

Finally, we had an idea for a visualization that combines most data, and visualizes it in a way that makes much sense and is very intuitive.
The idea is to show every week of a specific parking lot as a row of points, each has a color as before (green means Free, red means Full). To show trends over time, we put weeks above other ones, so that the lowest line is the earliest week:

Perfect! We can see here most insights from both visualizations above.

The weekly trends are vertical trends (green / red "columns"), and changes over time are horizontal trends, like complete rows (weeks) that are completely green, that demonstrate complete weeks with only Free statuses.

Classification Using Basic Data

Before we begin, we need to choose the features we want, and handle their types.
Currently, the features we have look like that:

If we want the models to be able to use the inner information of datetime, we need to create columns describing the day and the time of the day.

For day, the simplest option is a one-hot representation, with a column for every day of the week (using pd.get_dummies).
For hour, the simplest option is simply the numeric representation between 0-23.

When handling with a classification task, it is important to split wisely train and test data.

Usually, in non-time-based problems, the best practice is to split the data randomly.
However, in our case, we want to simulate reality, in which we have prior data and we want to predict the future.

For this reason, we start by splitting the data in a way that train is all data before 31/12/2020, and test is all the data from 2021.

In addition, we split our data by parking lot name, so we can train different models on each seperately.

Initial LogisticRegression

The accuracy results of this basic logistic regression are very diverse: the accuracy in Basel is as good as random (and worse than choosing the most common class) and the one in Dubnov is way higher than it's fair to expect.

When you think about it, there are a few reasons for that, which are quite clear.
The first and main reason is that we chose a random point in time, and measured ourselves according to this data split only. This point in time is much too meaningfull - if we chose another point in time we would get completely different results.
The second reason is that we used less that 2/3 of the data for training, and it might have not been enough. We should use more data for training and test ourselves on less.
The third is related to the first and is specific to Dubnov. It seems that Dubnov lot was nearly only free during the whole year of 2021 so far, so a simple calssifier that always predicts "free" would have almost 100% accury. This is not True for all the time before 2021, though.

Better Data Split

To avoid giving too much of a meaning to one chosen splitting point, we will use a technique similar to the classic k-folds cross validation.

We will leave some of the data aside as the test set, and use all other data for training and validating the results.
The train-validation data will then be used in 10 different splits, each containing a 12-month period train set and a 1-month period validation set, of the month following the train year.

Each model will be trained and tested 10 different times (independently), and its score will be the average accuracy score of all 10 times.
After comparing different models and different parameters, and choosing the best one according to its scores over the train-validation data, we will train the chosen model on all data before the test set starts, and evaluate its results on the test set.

As we planned, a random fold has ~12 times more data in its train set than its validation one, and the final test set has similar amount of data to a typical validation set.

Now let's run the same LogisticRegression model and see its average validation score.

Now it makes much more sense!
For example, Basel's score is much higher than 50%, and Dubnov's is not 99.9% anymore.


Now that we have a good way to measure the models, let's try DecisionTree and RandomForest and compare them to the initial LogisticRegression.

InitialDecisionTreeClassifier

Example DecisionTree Decision Making

This graph describes the decision process of the example Decision Tree model.
Each node contains classifying criteria such as hour or day of the week. The root of the graph is the starting point and then we follow its classification rules down to the nodes. We go left when the critiera are met, and right if not.
For example, let's say we wan't to predict the status at Friday's evening (6pm). First node classifies by threshold hour<=17.5. Since our hour does not meet with this criterion, we take the right path. Then the criterion is Friday<=0.5 ("not Friday") so we go right again. Finally we reach the final leaf that tells us our predicted class is Free.
Each node contains extra information that tells us about the certainty of each decision node. Gini score implies the probability of a class to belong to multiple classes. A gini score of 0 means that the decision node is 100% sure there is only one single class. Samples is the percentage of trained data that reached the corresponding decision node. Value describes how the trained data is distributed among the classes.

Initial RandomForestClassifier

Let's now take it another step forward, and compare more models with more parameter combinations to find the best one for each lot:

chossing the best classifier

As the results show, the best classifiers is not absolute and every lot has its best configuration.

This score is amazing, comparing to the initial logistic regression we started with. By changing nothing but models and params, the accuracy score was improved by almost 8%!

Advanced Time Normalization

The first improvement we want to try is to change the representation of the hour and the day from a numeric value of 0-23 and a one-hot representation, to a continuous representation in which the difference between the hours 23 and 1 is the same as the difference between 1 and 3, and the same with days of the week.
To do so, we will create a sinus and a cosinus of the day and the week, and instead of saving the hour or the day, we will save a point on the wave that represents the same information.

The results are surprising - the accuracy with the original representation of the days and hours led to higher accuracy. So, we will go back to using the original representation for now.

Adding More Data

In order to enhance our results, we want to add as much relevant information as possible to the dataset.
We will use the following data:

All the data was taken from the internet using some good search and crawling methods that won't be detailed here, and saved to different CSV files.

We are going to use each of the files separately, find its impact, and in the end, merge everything together.

Weather

The weather dataset includes a lot of information, most of it seems irrelevant or duplicated. Our gut feeling was that rain would have a high impact on parking ratio, and maybe temperature as well, as people change habbits in different weather conditions, for example, go to the beach (and park near it) more as the weather is hotter.

To explore the data, let's first merge it with our main dataframe and look for correlations with class column:

The results are surprising! Heat Index and Cloud Cover are more correlated to class than Conditions_Rain! The next step is to actually use the data to enhance the classification. In order to do that, we will choose some of the features. We will use only features that have a decent correlation, a decent existance ratio, and make at least some sense to us (as oppose to Wind Direction, which simply does not).
Therefore we will only take the features Heat Index, Cloud Cover and Conditions. Temperature has such a low correlation that it will not be included!

To get rid of Nones, we will fill none values with the minimum value of each of the columns.

There we go! The accuracy is improved by ~0.1% thanks to the new data.

Day / night hours

The day / night dataset includes the sunrise and sunset exact minute for each day. Our initial expectation was that it would have a high correlation with our class, because people change habbits according to the daylight hours, especially in the evening (i.e. we expect sunset to be more informative than sunrise).

To be able to use the data, let's normalize it, merge with our main dataframe and check its correlation with class:

Surprisingly, sunrise is the one that has a much higher correlation than sunset. Let's see how it helps our classifiers:

It seems that this new information only confuses the classifiers. Let's double check that adding sunset back doesn't help:

Looks like we made the right decision leaving sunset out, but anyway, the new data does not help. We will leave it out.

Holidays

What we have here is a list of all holidays in the relevant dates, including a large range of what can be called holidays:

Let's merge the dataframe with our main one and see how the class changes during different holidays:

Super interesting! It looks very indicative - erev_hag is mainly free, all_vacation and hol_hamoed are similar but a bit less, and elections are mainly full which is very unusual.

Let's validate our conclusions with correlation scores:

Alright, the results show what we expected and we are ready to use it in our classification.

Let's get rid of the non-correlated ones and continue:

As expected, the data did help the results. Not as much as we expected, but any accuracy increase is good news.

The next step is, obviously, merging the good from each dataset together and see the results:

merging

The final result is certainly higher than the "initial best classifier"!

We are now ready to test our final classifier on the test data.

Evaluating on TEST

These results are amazing. Not only is the average score pretty high, but it is also very consistent with our validation results, which means we didn't overfit the training and validation data.

Deep Learning

Let's utilize keras deep learning models to improve our prediction accuracy!

Classification with Deep Learning

Before trying some fancy deep learning models, we should start with a simple DNN that attempts to predict parking availability status by our time features (sin/cos of day/week).

Linear DNN classifier

Great, we got decent prediction accuracy with the linear model. We might get better results if we add a hidden layer.

Dense DNN model

We got some really nice results! Let's see what happens when we add more hidden layers with dropout to prevent overfitting.

Adding more and more layer not necessarily improved accuracy. Some lots improved while other didn't. Let's try to add more features to the model, like we did before

Overall, it seems like adding more features confused our model. Lots like Basel and Asuta are usually not affected by weather or vacations and we got less accurate predictions there. On the other hand, seasonal lots located near attractions such as Sheraton, Dan and Habima got a better prediction.
Let's evaluate our best model on test data.

Time series forcasting with DNN

So far we only classified data based on their features without the context of their timely order.
In the next section, we will create models that forecasts new timesteps based on prior information.
The model's input will be [input_timesteps, labels+features] and output would be [output_timesteps, labels].
For example, a model that forecasts 3 hours into to future based on 6 hours with 4 labels (sin/cos of day/week) and 1 feature (class) would have input shape of [6, 5] and output shape of [3, 1].
To build such dataset, we will use tf.keras.preprocessing.timeseries_dataset_from_array that breaks the dataset into batches of consecutive timeseries with predefined lengths. Then we will split the input from the outputs.

Single Step prediction

Let's start by predicting a data point just like before, but this the model will have access to previous hours.

Baseline Repeat

Before we create deep learning models, let's create a benchmark with a simple models that repeats over and over the input variables out

First, let's run the repeat baseline with 1 hour prediction. Model gets features and label of one hour and predicts the label of the next hour.

Great! we got really nice results here. Let's see if we can push this further using deep learning model.

Thats's interesting, the simple dnn model could not beat the baseline model. Our model found a local optima that was not good enough. Let's dig into our model and see what it learned:

As expected, the most significant feature used for prediction was the last hour's label (class).
Let's see what happens when we use different optimizer such as SGD.

This time, the model scored virtually the same as our baseline. When picking into its weights, we can see it gave the class feature more significance.

Let's try again with another hidden layer and activation.

Mostly the same results as baseline.

Multistep forecast

Next, we will attempt to predict multiple points instead of single prediction.

Baseline

Let's see how well our baseline scores when it needs to predict a whole day based on its previous one.

Baseline model reached 82% accuracy! Like before, let's try to improve accuracy using deep learning models.

Simple DNN

This is great! Even though our model is still pretty basic, we scored with pretty decent accuracy!

CNN Model

Next, we will try to improve accuracy by granting our model access to multiple timesteps features each time.
The input of our model is a 3d tensor with shape ofbatch, timestep, features. Dense layers operate only on the 3rd axis, so different timesteps are tuned separately.
We can use Flatten or Conv1D layers to operate on multiple timesteps simultaneously.

Adding CNN layer did not improve our accuracy at all. Let's try another approach and use LSTM layer that learns "trends" with fewer trainable variables.

Still not improving. The LSTM layer could hide the y features (sin/cos of day/week) we injected into our dataset. To overcome this issue, we can come up with non-sequential model that feeds the following Dense layer with the original y features input.
To do so, we will create a custom class that derives from keras Model, and write our logic.

We managed to improve our 24h prediction accuracy even more! Lets see how we're doing so far:

Autoregressive Models

In the previous section, we used single-shot models that predicted multiple steps simultaneously.
Another approach we can try, is to create self-feeding model that predicts multiple steps, but one step at a time.
For example, we can write a simple AR model based on simple dense layer with the following logic:

Since our model returns features and labels (instead of just the single label - class), we need to customize our metrics and loss functions to ignore irrelevant labels.

Amazing! we pushed further and we're almost reaching 88%. This score seems promising, so we will evaluate it on test data

Next, we will try to run an LSTM model with the same autoregressive approach. We will create warm up step to prepare our lstm state, and then continue with feeding this model its own predictions.

As we saw before, LSTM could potentialy hide important features such as y-features (day/week sin/cos), we can feed our dense layer directly with it.

We improved a little bit, still behind the CNN-Dense AR model. Let's evaluate this model on test.

Classifier Results

We tackled our data with lots of classification approach. Let's summarize the results!

Validation Results

Test Results

Accuracy and Anomalies

We can compare our predictions to the true values to find special patterns and anomalies which could help us in the future improve.
First, let's take our sklearn best classifier we found before and train them with all our data.

Then, we can create special dataframe with success column

Visualize Accuracy

Just like we did at the beginning, we can use our data exploration visualizations to learn about our accuracy.

On most of the parking lots, we can spot some patterns of missed predictions. It seems as the pattern is repeated on certain hours and days.
Let's continue with our exploration and look for weekly patterns.

Intersting. We definitely see some hourly patterns here! Let's zoom in and group by the hour

Now we got it! We can clearly see the unique patterns each lot has!
Let's see if we can spot any anomalies over time

We can find some interesting anomalies on SOME of the data. For example the week of 21/03/20's was the beginning of the first lockdown

Using our Deep learning models

We can sift out some minor anomalies if we check our DNN models with sklearn's models.
First, we need to run and evaluate our model on all our data.

We can created unified dataframe with sklearn's predictions and see where both of the models did not succeed.

Like before, lets run the same visualizations, this time on any_success so we will get better results.

Great, we can see that there is less random noise and we can definitely spot the existance of patterns. Now let's focus on week and hourly patterns:

Here we go! this time we can see that Basel and Asuta were harder to predict specifically very early at the morning, when someone leaves its parking spot and there is barely traffic, and around 6-7pm, when people come back from work and the parking lots are queuing up.

Over time, we can spot some anomalies usually around holidays. More over, we can see some spikes on uncertainty around April 2020, September 2020 and December 2020 - months with heavy covid restrictions!

Final Words

We had great time working on this intersting project! Each and every aspect of it has taught us new skills. We practiced data visualizations, worked with classifiers and tweaked some deep learning models. We also learned techniques to cross-validate our results to reduce bias.
Even though we decided to end this project here, we believe there are many ways we can continue exploring from here. We might be able to find correlations with other data such as NLP analysis of news. There is also much room for improvement in our models - we only tested few configurations with not-so-effective computational efficiency. We could push our anomaly detection even further and create 2nd degree models that predicts anomalies using our trained models!
Since we're staying in Tel-Aviv, we will keep collecting data and continue this project and on our free time. We already have almost two years of data, it would be worth checking it again once we have even more data.